Add torchcompile_xentropy executor #1655

riccardofelluga · 2025-01-17T10:36:52Z

What does this PR do?

In an attempt to fix #1654 and partially #1552, this PR adds the necessary ops in the torchcompile_cat list to capture HF CausalLMLoss.

Before merging this PR needs testing with other models, I will write a message with the results of the benchmarks.

IvanYashchuk · 2025-01-21T14:16:57Z

This executor is meant for fusions with a cat operation. Cross entropy is usually quite far from RoPE (which uses cat). How does adding cross_entopy to torchcompile_cat help?

riccardofelluga · 2025-01-21T18:56:25Z

@IvanYashchuk Adding cross entropy does indeed not help with rope but it allows thunder to use a very efficient fused cross entropy triton kernel which perf currently cannot be matched with the other executors(not even apex). From lines 211 and 212, it does seem that while the set of ops was originally though for rope, it is suggested that other ops could have been added in the future making this a fusion executor that comes in rescue of nvFuser when it can improve performance. Do you think that it would be better to create another executor entry for inductor cross entropy only?

… torchcompilecat-add-xentropy

This reverts commit 6c5b0e1.

for more information, see https://pre-commit.ci

…equired

Reducing the models size to speedup CI and make it work in environments with constrained memory size. I think this change should be fine since with this test we are verifying functionality more than memory footprint. Let me know what you think. With this change the peak memory of the test is ~2.6GB instead of ~14GB

t-vi · 2025-01-30T08:05:57Z

I have a hard time seeing another torch.compile executor as the solution here.
How are they supposed to interact?
Will we see consecutive torch compile regions as a result?

riccardofelluga · 2025-01-30T08:44:27Z

@t-vi
How are they supposed to interact?
Will we see consecutive torch compile regions as a result?

I'd piggyback on this explaination on the differences between torchcompile and torchcompile_cat executors:

lightning-thunder/thunder/executors/torch_compile.py

Lines 209 to 215 in 880419d

    
           # NOTE: [torch_compile_cat_ex vs torch_compile_ex] 
        
           # The former only relies on `torch.compile` for the operators where it shines the most and is meant to be used 
        
           # together with the nvfuser executor. Its current goal is only to fuse RoPE but the set of ops fused will change as each 
        
           # of the fusion backends evolve. 
        
           # The latter will try to `torch.compile` all the torch operators and is meant to be used without the nvfuser_executor 
        
           # since they would be competing over fusion opportunities. The advantage over simply doing `torch.compile` is that you 
        
           # still get all of Thunder's advantages, like enabling custom executors (e.g. with custom triton kernels) before it.

torchcompile_xentropy is treated the same way, it is not supposed to be used with torchcompile. On the other hand i can see the worry since it can be used together with torchcompile_cat, however, the combination of required+supported ops are different from torchcompile_cat so one won't claim the ops for the other.

An issue I see is that we don't strictly enforce this rule, so in case where the user specifies torchcompile and torchcompile_cat or torchcompile_xentropy, the former will claim all the ops

t-vi · 2025-02-09T10:38:24Z

torchcompile_xentropy is treated the same way, it is not supposed to be used with torchcompile. On the other hand i can see the worry since it can be used together with torchcompile_cat, however, the combination of required+supported ops are different from torchcompile_cat so one won't claim the ops for the other.

My main issue with this is that we should really have one torchcompile fusion executor in order to not split what could be a big fusion. And the solution would be to split the marking of things to be fused by the torchcompile executor and then But let's have that later.

t-vi

Thank you @riccardofelluga

add cross_entropy to torchcompile_cat executor

6c5b0e1

riccardofelluga mentioned this pull request Jan 17, 2025

Investigate Memory and Performance difference using nvfuser vs torch.compile executor on Qwen2 #1552

Open

nvMelissa mentioned this pull request Jan 27, 2025

nvFuser using more memory than inductor for HF CausalLMLoss #1654

Closed

riccardofelluga added 4 commits January 28, 2025 16:37

Merge branch 'main' of github.com:Lightning-AI/lightning-thunder into…

3982cb7

… torchcompilecat-add-xentropy

Revert "add cross_entropy to torchcompile_cat executor"

ecc1f7c

This reverts commit 6c5b0e1.

move xentropy in its own executor

0927d74

reshape not always required

e11ddab

riccardofelluga changed the title ~~[WIP] Add cross_entropy to torchcompile_cat executor~~ [WIP] Add torchcompile_xentropy executor Jan 28, 2025

riccardofelluga added 2 commits January 28, 2025 17:28

add torchcompile_xentropy to the tests

d815892

add torchcompile_xentropy as default executor

39530d3

riccardofelluga changed the title ~~[WIP] Add torchcompile_xentropy executor~~ Add torchcompile_xentropy executor Jan 28, 2025

pre-commit-ci bot and others added 11 commits January 28, 2025 15:29

[pre-commit.ci] auto fixes from pre-commit.com hooks

284e7d9

for more information, see https://pre-commit.ci

remove torch.nn.functional.cross_entropy from supported as it is in r…

c4d98e9

…equired

bump version (#1708)

9338118

nvfuser: add option to allow shape only region (#1702)

c875621

Backward transform dependency fix (#1693)

7e857f5

pin check schema (#1709)

fcec023

bump transformers version (#1698)

9a40bb0

fix naming

71bd489

add pad prims

d94a420

add test for the new executor

35c4606

github-actions bot added dependencies ci labels Jan 29, 2025

Merge branch 'main' into torchcompilecat-add-xentropy

e0e32e0

github-actions bot removed dependencies ci labels Jan 29, 2025

riccardofelluga requested a review from IvanYashchuk January 29, 2025 16:24

riccardofelluga added 2 commits January 29, 2025 18:57

add missing import

3edc0a3

Merge branch 'main' into torchcompilecat-add-xentropy

3cebfe3

riccardofelluga marked this pull request as ready for review January 30, 2025 07:37

riccardofelluga requested review from mruberry, lantiga and t-vi as code owners January 30, 2025 07:37

Merge branch 'main' into torchcompilecat-add-xentropy

52294ec

t-vi approved these changes Feb 9, 2025

View reviewed changes

t-vi merged commit 5b847bc into main Feb 9, 2025
49 checks passed

t-vi deleted the torchcompilecat-add-xentropy branch February 9, 2025 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add torchcompile_xentropy executor #1655

Add torchcompile_xentropy executor #1655

riccardofelluga commented Jan 17, 2025

IvanYashchuk commented Jan 21, 2025

riccardofelluga commented Jan 21, 2025

t-vi commented Jan 30, 2025

riccardofelluga commented Jan 30, 2025

t-vi commented Feb 9, 2025

t-vi left a comment

Add torchcompile_xentropy executor #1655

Add torchcompile_xentropy executor #1655

Conversation

riccardofelluga commented Jan 17, 2025

What does this PR do?

IvanYashchuk commented Jan 21, 2025

riccardofelluga commented Jan 21, 2025

t-vi commented Jan 30, 2025

riccardofelluga commented Jan 30, 2025

t-vi commented Feb 9, 2025

t-vi left a comment

Choose a reason for hiding this comment